Data tidying is a necessary first step for data analysis - it's the process of taking your messily formatted data (missing values, unwieldy coding/organization, etc.) and literally tidying it up so it can be easily used for downstream analyses. To quote Hadley Wickham, "Tidy datasets are easy to manipulate, model and visualise, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table."
These data are actually pretty tidy, so we're going to be focusing on cleaning and transformation, but these manipulations will give you some idea of how to tidy untidy data.
We are going to be using the data from the R package nycflights13
. There are five datasets corresponding to flights departing NYC in 2013. We will load directly into R from the library, but the repository also includes CSV files we created for the purposes of the Python demo.
For this tutorial we'll be using the following packages in Python
You can install these with either pip
or conda
Pandas is an extremely useful package for data-manipulation in python. It allows for a few things:
The core pandas object is a 'dataframe' - modeled after DataFrames in R
In [ ]:
from __future__ import print_function # For the python2 people
import pandas as pd # This is typically how pandas is loaded
Let's read data from a file
There are five tables we'll be using as part of the NYCFlights13 dataset
To view them, first extract the archive that comes with this repo
unzip nycflights13.zip
Now, to read them in as dataframes, we'll use the read_table function from pandas
This is a general purpose function for reading tabular data in a text file format. If you follow the link, you can see that there are many configurable options. We'll just use the defaults (assumes tab-delimited)
In [ ]:
airlines = pd.read_table("airlines.txt")
airports = pd.read_table("airports.txt")
flights = pd.read_table("flights.txt")
planes = pd.read_table("planes.txt")
weather = pd.read_table("weather.txt")
In [ ]:
print(type(planes)) # Yup, it's a DataFrame
In [ ]:
# What does it look like?
planes # Jupyter Notebooks do some nifty formatting here
In [ ]:
# How big is it?
print(planes.shape) # Works like numpy
In [ ]:
print(planes.columns) # What are the column labels?
In [ ]:
print(planes.index) # What are the row labels?
In [ ]:
# Let's grab a column
planes['manufacturer']
In [ ]:
# Inspecting this column further
manufacturer = planes['manufacturer']
print(type(manufacturer)) # It's a Series
In [ ]:
# Indexing into Series
print("Indexing into Series: ", manufacturer[3])
# Indexing into DataFrame
print("Indexing into DataFrame: ", planes.loc[3, 'manufacturer'])
In [ ]:
third_row = planes.loc[3] # get the third row
third_row
In [ ]:
print(type(third_row))
In [ ]:
planes = planes.set_index('tailnum')
# OR
planes = pd.read_table('planes.txt', index_col=0) #Set the first column as the index
planes.loc['N10156']
In [ ]:
print(planes.iloc[3]) # Get the third row
print(planes.iloc[:, 3]) # Get the third column
In [ ]:
print('What are the first 5 rows?')
flights.head()
In [ ]:
print('What are the last 5 rows?')
flights.tail()
In [ ]:
print('Sample random rows')
flights.sample(3, axis=0) # Axis 0 represents the rows, axis 1, the columns
In [ ]:
print('What are the dimensions of the flights dataframe?\n')
print(flights.shape)
print('Are there any NAs in the flights dataframe?\n')
print(flights.isnull().any())
In [ ]:
print('Selecting for flights where there is complete data, what are the dimensions?\n')
print("Original Matrix Shape:", flights.shape)
null_rows = flights.isnull().any(axis=1) # Rows where any value is null
flights_complete = flights.loc[~null_rows]
print("Complete-rows shape:", flights_complete.shape)
In [ ]:
print(type(null_rows))
null_rows
The great thing about Pandas is that if you pass in a Series, the order of the elements in it doesn't matter anymore. It uses the index to align the Series to the row/column index of the dataframe.
This is very useful when creating a boolean index from one dataframe to be used to select rows in another!
Alternately, with removing NA values there is a dropna function that can be used.
Now...back to flights!
In [ ]:
print('How might I obtain a summary of the original dataset?')
flights.describe() # Similar to R's 'summary'
# use include='all' to include the non-numberic columns too
Pandas allows easy application of descriptive function along an axis.
any which we used earlier, is an example of that. If the data is boolean, any collapses a series of boolean values into True if any of the values are true (otherwise, False)
Can also use min, max, mean, var, std, count
In [ ]:
# An example
flights['air_time'].mean() # Returns a single value
In [ ]:
subset = flights[['air_time', 'dep_delay', 'arr_delay']]
subset.mean(axis=0) # Axis 0: collapse all rows, result has Index = to original Columns
If you want to apply an arbitrary function along an axis, look into the apply function
Sometimes you may want to perform some aggregate function on data by category, which is encoded in another column. Here we calculate the statistics for departure delay, grouping by origin of the flight - remember this is the greater NYC area, so there are only three origins!
In [ ]:
result = flights_complete.groupby('origin')['dep_delay'].mean()
result
In [ ]:
# What is this object?
print(type(result))
Other descriptive functions work here, like 'std', 'count', 'min', 'max'
Also: describe
In [ ]:
flights_complete.groupby('origin')['dep_delay'].describe()
You will likely need to combine datasets at some point. For simple acts of stitching two dataframes together, the pandas concat method is used.
Let's create a data frame with information on flights by United Airlines and American Airlines only, by creating two data frames via subsetting data about each airline one by one and then merging.
The main requirement is that the columns must have the same names (may be in different order).
In [ ]:
print('Subsetting the dataset to have 2 dataframes')
flightsUA = flights.loc[flights.carrier == 'UA',]
flightsAA = flights.loc[flights.carrier == 'AA',]
print('Checking the number of rows in two dataframes')
print(flightsUA.shape[0] + flightsAA.shape[0])
In [ ]:
print('Combining two dataframes than checking the number of rows in the resulting data frame')
flightsUAandAA = pd.concat([flightsUA,flightsAA], axis=0) # axis=1 would stitch them together horizontally
print(flightsUAandAA.shape[0])
Nothing special, just be sure the dataframes have the columns with the same names and types.
In [ ]:
print('Binding 3 data frames and checking the number of rows')
allthree = pd.concat([flightsUA,flightsAA,flightsUAandAA])
allthree.shape[0]
In [ ]:
airports.head()
The airports
table gives us a key! Let's merge the flights
data with the airports
data, using dest
in flights
and faa
in airports
.
In [ ]:
print('Merging in pandas')
flights_readdest = flights_complete.merge(airports, left_on='dest', right_on = 'faa', how='left')
flights_readdest.head()
Why did we use how='left'
?
In [ ]:
len(set(airports.faa) - set(flights.dest))
There are 1357 airports in the airports table that aren't in the flights table at all.
Here are the different arguments for how and what they'd do:
Well this merged dataset is nice, but do we really need all of this information?
In [ ]:
flights_readdest.columns
In [ ]:
flights_sm = flights_readdest[['origin', 'name', 'year', 'month', 'day', 'air_time']]
flights_sm.head()
In [ ]:
# Renaming is not so simple in pandas
flights_sm = flights_sm.rename(columns = {'name': 'dest'})
flights_sm.head()
Since each operation gives us back a dataframe, they are easily chained:
In [ ]:
airtime = flights_complete.merge(airports, left_on='dest', right_on='faa', how='left') \
.loc[:, ['origin', 'name', 'air_time']] \
.groupby(['origin', 'name'])['air_time'] \
.mean()
print(airtime.shape)
airtime.head()
Goal: What's the longest flight from each airport, on average?
Here, 'airtime' is a little abnormal because it's Index has two levels
- First level is the 'origin'
- Second level is the name of the destination
This is because we grouped by two variables.
Now we need to group by 'origin' and apply the 'max' function. Groupby can work for the levels of a multi-index too
In [ ]:
airtime.groupby(level='origin').max()
In [ ]:
# What if we want to know where the flight goes?
rows = airtime.groupby(level='origin').idxmax() # This returns the indices in airtime where the max was found
airtime[rows] # Index by it to get the max rows
In [ ]:
pvt_airtime = airtime.unstack() # Since airtime has a hierarchical index, we can use unstack
pvt_airtime
However, often you want to pivot just a regular dataframe. I'll create one from airtime for an example:
In [ ]:
airtime_df = pd.DataFrame(airtime).reset_index()
airtime_df.head()
In [ ]:
airtime_pv = airtime_df.pivot(index='origin',
columns='name',
values='air_time')
airtime_pv
In [ ]:
weather.head()
In [ ]:
print(flights_complete.columns & weather.columns) # What columns do they share?
In [ ]:
flights_weather = flights_complete.merge(weather,
on=["year", "month","day","hour", "origin"])
print(flights_complete.shape)
print(flights_weather.shape)
flights_weather
has less rows. Default behavior of merge is 'inner' and so this means there are some flight year/month/day/hour/origin combos where we don't have a weather entry
In [ ]:
# Let's grab flights+weather where the delay was greater than 200 minutes
flights_weather_posdelays = flights_weather.loc[flights_weather.dep_delay > 200]
flights_weather_posdelays.shape
In [ ]:
# Anything unusual about these flights?
%matplotlib notebook
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure()
plt.hist(flights_weather.dropna().wind_gust, 30, range=(0, 50), normed=True, label='normal', alpha=.7)
plt.hist(flights_weather_posdelays.dropna().wind_gust, 30, range=(0,50), normed=True, label='delayed', alpha=.7)
plt.legend(loc='best')
plt.title('Wind Gust')
plt.figure()
plt.hist(flights_weather.dropna().pressure, 30, normed=True, label='normal', alpha=.7)
plt.hist(flights_weather_posdelays.dropna().pressure, 30, normed=True, label='delayed', alpha=.7)
plt.legend(loc='best')
plt.title('Pressure')
plt.figure()
plt.hist(flights_weather.dropna().hour, 30, normed=True, label='normal', alpha=.7)
plt.hist(flights_weather_posdelays.dropna().hour, 30, normed=True, label='delayed', alpha=.7)
plt.legend(loc='best')
plt.title('Hour')
In [ ]:
flights_weather.sort_values('dep_delay').head(10)
In [ ]:
flights_weather.sort_values('dep_delay', ascending=False).head(10)
In [ ]:
flights_complete.dest.str.lower().head() # For string columns, use .str to access string methods
In [ ]:
flights_complete.dest.str.upper().head()
In [ ]:
flights_complete.head()
In [ ]:
day_delay = pd.melt(flights_complete, id_vars=['hour', 'time_hour'], value_vars=['dep_delay', 'arr_delay'], var_name='type_of_delay')
day_delay
In [ ]:
plt.figure()
sns.stripplot(x='hour', y='value', hue='type_of_delay', data=day_delay)
In [ ]:
day_delay_first = day_delay.drop_duplicates('time_hour', keep='first')
day_delay_first.head()
In [ ]:
flights.isnull().sum(axis=0)
In [ ]:
flights_incomplete = flights.loc[flights.isnull().any(axis=1)]
flights_incomplete.shape
Do flights with NA departure time also have an NA departure delay?
In [ ]:
pd.crosstab(
index=flights_incomplete.dep_time.isnull(), # Series of bool values
columns=flights_incomplete.dep_delay.isnull() # series of bool values
)
Yes